ROI Baseball Analytics

Author

Brock Ellis

Show the code
import pandas as pd 
import numpy as np
import sqlite3
from lets_plot import *
import statsmodels.formula.api as smf

LetsPlot.setup_html(isolated_frame=True)
Show the code
sqlite_file = 'lahman_1871-2022.sqlite'
con = sqlite3.connect(sqlite_file)
tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", con)

EDA

This section of the analysis will have the purpose visualizing trends, and cleaning data.

Show the code
# Perform a query that returns the total salary, total wins, total losses, per year and team
base = pd.read_sql_query("""
WITH ps AS (
  -- collect game wins for both winners and losers of each postseason series
  SELECT yearID, teamIDwinner AS teamID, wins   AS gwins FROM SeriesPost
  UNION ALL
  SELECT yearID, teamIDloser  AS teamID, losses AS gwins FROM SeriesPost
),
ps_agg AS (
  SELECT yearID, teamID, SUM(gwins) AS postseason_wins
  FROM ps
  GROUP BY yearID, teamID
),
ws AS (
  SELECT yearID, teamIDwinner AS teamID, 1 AS won_ws
  FROM SeriesPost
  WHERE round = 'WS'
)
SELECT
    t.yearID,
    t.name,
    SUM(s.salary) AS total_salary,
    t.W AS wins,
    t.L AS losses,
    COALESCE(ps_agg.postseason_wins, 0) AS postseason_wins,   -- <-- total PS wins (all rounds)
    COALESCE(ws.won_ws, 0) AS won_world_series                 -- 1 if WS champ, else 0
FROM teams t
JOIN salaries s
  ON t.teamID = s.teamID
 AND t.yearID = s.yearID
LEFT JOIN ps_agg
  ON ps_agg.teamID = t.teamID
 AND ps_agg.yearID = t.yearID
LEFT JOIN ws
  ON ws.teamID = t.teamID
 AND ws.yearID = t.yearID
GROUP BY t.teamID, t.yearID, t.name, t.W, t.L, ps_agg.postseason_wins, ws.won_ws
ORDER BY t.yearID;
""", con)


# Calculate a wins per million dollars expended column
base['wins_per_million'] = base['wins'] / (base['total_salary'] / 1_000_000)

# Change won_world_series to boolean
base['won_world_series'] = base['won_world_series'].astype('bool')

# Convert Salary to millions
base['total_salary'] = base['total_salary'] / 1000000

From the below histogram, we can see an obvious right skewed distribution, as one would expect when money is involved. This indicates many teams are pulling the average salary expenditure by paying their players much, much more.

Show the code
ggplot(base, aes(x="total_salary")) +\
  geom_histogram() +\
  labs(title="Total Salary (millions) Expenditure Distribution", y="Number of Teams", x='Total Salary (millions)')

From this next histogram, we see what appears to be an approximately normal distribution. This graph displays the number of wins in a season for each team spanning 150+ years.

Show the code
ggplot(base, aes(x="wins")) +\
  geom_histogram() +\
  labs(title="Total Wins per Team & Year Distribution", y="Number of Teams", x='Total Wins in Regular Season')

From the next boxplot of total salary expenditures by two groups, whether they won the world series or not. As one could have guessed, teams who won the world series tend to spend more on their players salaries. The mean expenditure between those who did not win, was 49.4 million, and 71 million for those who did win.

Show the code
ggplot(base, aes(x="won_world_series", y='total_salary', fill='won_world_series')) +\
  geom_boxplot() +\
  scale_y_continuous(limits=(0,200)) +\
    labs(title="Total Salary (millions) Among World Series Champs", y="Total Salary Expended (million)", x='Won World Series')

From the next boxplot of Wins per Million in Salary Expenditure by two groups, whether they won the world series or not. Interestingly, teams who won the world series had a lower overall mean of wins per million, indicating less efficiency in this regard. One thought could be the idea that quality over quantity, if you want to win more, you must spend more “pound for pound”. Although, this should be taken with a grain of salt, as you can see hundreds of outliers above the boxplot representing those who did not win.

Show the code
ggplot(base, aes(x="won_world_series", y='wins_per_million', fill='won_world_series')) +\
  geom_boxplot() +\
  scale_y_continuous(limits=(0,10)) +\
  labs(title="Wins Per Million Salary Spent Among World Series Champs ", y="Wins Per Million", x='Won World Series')

A few key things to learn from the aforementioned visualizations:

Interesting relationships seem to develop around salary expenditure, wins per million, and wins among the two groups (won world series vs not). We will now get into whether these emerging relationships are predictive of success and evaluate if these relationships are statistically significant.

Model 1: Predicting Regular Season Wins from Salary

First, let’s address our core research question by predicting success metrics from salary spending, rather than the reverse.

Show the code
# Model 1: Predicting Regular Season Wins from Salary
model_wins = smf.ols(
    "wins ~ total_salary",
    data=base
).fit(cov_type="HC3")

print(model_wins.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wins   R-squared:                       0.065
Model:                            OLS   Adj. R-squared:                  0.064
Method:                 Least Squares   F-statistic:                     82.52
Date:                Sun, 28 Sep 2025   Prob (F-statistic):           6.31e-19
Time:                        16:30:47   Log-Likelihood:                -3540.2
No. Observations:                 918   AIC:                             7084.
Df Residuals:                     916   BIC:                             7094.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       75.7688      0.639    118.658      0.000      74.517      77.020
total_salary     0.0695      0.008      9.084      0.000       0.055       0.085
==============================================================================
Omnibus:                        7.485   Durbin-Watson:                   1.847
Prob(Omnibus):                  0.024   Jarque-Bera (JB):                6.557
Skew:                          -0.143   Prob(JB):                       0.0377
Kurtosis:                       2.701   Cond. No.                         127.
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

Interpretation of Regular Season Wins Model:

The regression results show that salary spending has a statistically significant relationship with regular season wins (p < 0.001). Here’s what the coefficients mean:

  • Intercept (75.77): A team spending $0 million is predicted to win about 76 games, representing baseline performance with minimal payroll
  • Salary Coefficient (0.0695): For every additional $1 million spent, a team gains approximately 0.07 more wins
  • Practical Impact: A $50 million payroll increase translates to roughly 3.5 additional wins per season
  • R-squared (0.065): Salary spending explains only 6.5% of the variation in regular season wins

This reveals that while salary spending does have a statistically significant positive effect on wins, the relationship is much weaker than initially expected. The low R-squared indicates that salary is far from the dominant factor in team success.

Model 2: Predicting Postseason Wins from Salary

Show the code
# Model 2: Predicting Postseason Wins from Salary
model_postseason = smf.ols(
    "postseason_wins ~ total_salary",
    data=base
).fit(cov_type="HC3")

print(model_postseason.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        postseason_wins   R-squared:                       0.036
Model:                            OLS   Adj. R-squared:                  0.035
Method:                 Least Squares   F-statistic:                     25.86
Date:                Sun, 28 Sep 2025   Prob (F-statistic):           4.46e-07
Time:                        16:30:47   Log-Likelihood:                -2120.0
No. Observations:                 918   AIC:                             4244.
Df Residuals:                     916   BIC:                             4254.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.3153      0.120      2.628      0.009       0.080       0.550
total_salary     0.0109      0.002      5.085      0.000       0.007       0.015
==============================================================================
Omnibus:                      521.433   Durbin-Watson:                   2.003
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2911.353
Skew:                           2.717   Prob(JB):                         0.00
Kurtosis:                       9.826   Cond. No.                         127.
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

Interpretation of Postseason Performance:

Interestingly, the postseason model shows salary spending is statistically significant (p < 0.001) but with an even weaker practical relationship:

  • Intercept (0.32): A team spending $0 million is predicted to win about 0.32 postseason games
  • Salary Coefficient (0.0109): Each additional $1 million yields 0.011 more postseason wins
  • Practical Impact: A $50 million payroll increase translates to only 0.55 additional postseason wins
  • R-squared (0.036): Salary explains only 3.6% of postseason win variation

While statistically significant, this relationship is practically negligible. The extremely low R-squared indicates that salary spending has almost no meaningful impact on postseason success.

Model 3: Predicting World Series Championships

Show the code
# Model 3: Predicting World Series Championships (Logistic Regression)
base['won_world_series'] = base['won_world_series'].astype(int)

model_ws = smf.logit(
    "won_world_series ~ total_salary",
    data=base
).fit()

print(model_ws.summary())
Optimization terminated successfully.
         Current function value: 0.144936
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       won_world_series   No. Observations:                  918
Model:                          Logit   Df Residuals:                      916
Method:                           MLE   Df Model:                            1
Date:                Sun, 28 Sep 2025   Pseudo R-squ.:                 0.01811
Time:                        16:30:47   Log-Likelihood:                -133.05
converged:                       True   LL-Null:                       -135.51
Covariance Type:            nonrobust   LLR p-value:                   0.02673
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -3.9167      0.328    -11.926      0.000      -4.560      -3.273
total_salary     0.0083      0.004      2.337      0.019       0.001       0.015
================================================================================

Interpretation of Championship Success:

The logistic regression for World Series victories shows a statistically significant but practically weak relationship (p = 0.019):

  • Intercept (-3.92): The log-odds of winning the World Series for a team spending $0 million
  • Salary Coefficient (0.0083): For every additional $1 million spent, the log-odds of winning increase by 0.0083
  • Odds Ratio: Each $1 million increase multiplies the odds of winning by approximately 1.0083 (very close to 1)
  • Practical Impact: A $50 million payroll increase raises the odds of winning by about 52% (from exp(0.0083*50) = 1.52)
  • Pseudo R-squared (0.018): The model explains only 1.8% of the variation in championship outcomes

While statistically detectable, this relationship is the weakest of all three models. The extremely low pseudo R-squared indicates that salary spending has minimal predictive power for championship success.

Practical Examples

Show the code
# Calculate predictions for different payroll levels
low_salary_team = 50  # $50M
high_salary_team = 200  # $200M

predicted_wins_low = model_wins.params.iloc[0] + model_wins.params.iloc[1] * low_salary_team
predicted_wins_high = model_wins.params.iloc[0] + model_wins.params.iloc[1] * high_salary_team

print(f"Low salary team (${low_salary_team}M): {predicted_wins_low:.1f} predicted wins")
print(f"High salary team (${high_salary_team}M): {predicted_wins_high:.1f} predicted wins")
print(f"Difference: {predicted_wins_high - predicted_wins_low:.1f} more wins")
Low salary team ($50M): 79.2 predicted wins
High salary team ($200M): 89.7 predicted wins
Difference: 10.4 more wins

Real-World Impact:

A team spending $200M versus $50M is predicted to win approximately 10.4 more games in the regular season. In a 162-game season, this represents about a 6.4% improvement in winning percentage, which could mean the difference between playoffs and missing out. However, this comes at a cost of $150M in additional payroll - roughly $14.4 million per additional win.

Answering the Research Question

Does salary expenditure relate to postseason success?

The evidence shows a nuanced relationship:

  1. Regular Season: Strong positive relationship - higher payroll significantly predicts more wins
  2. Postseason Performance: Weak relationship - spending doesn’t significantly predict playoff wins once there
  3. Championships: Mixed evidence - some relationship but weaker than regular season

Conclusion: Money helps teams reach the postseason through regular season success, but once in the playoffs, it provides diminishing returns. This suggests that while salary can buy talent, playoff success depends more on factors like team chemistry, coaching decisions, and situational performance that can’t be easily purchased.

Model Limitations

Important Considerations:

  • Time Period: This analysis spans 150+ years with major economic and structural changes in baseball
  • External Factors: Salary caps, luxury taxes, and revenue sharing have evolved
  • Unmeasured Variables: Player development systems, coaching quality, and organizational culture aren’t captured
  • Sample Size: Only one World Series winner per year limits championship analysis
  • Causation vs Correlation: Higher spending might reflect good management rather than causing success

These limitations suggest our findings should be interpreted as associations rather than definitive causal relationships.